How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
نویسندگان
چکیده
We investigate evaluation metrics for endto-end dialogue systems where supervised labels, such as task completion, are not available. Recent works in end-to-end dialogue systems have adopted metrics from machine translation and text summarization to compare a model’s generated response to a single target response. We show that these metrics correlate very weakly or not at all with human judgements of the response quality in both technical and non-technical domains. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.
منابع مشابه
Speech understanding, dialogue management and response generation in corpus-based spoken dialogue system
This paper presents construction of a spoken dialogue system using a large-scale spoken dialogue corpus with intention tags. In this system, all of main components, such as speech understanding, dialogue management, and response generation, are constructed with corpus-based methods. An evaluation experiment using a test set has shown that the performance of the corpus-based dialogue system is i...
متن کاملOn-Line Learning of a Persian Spoken Dialogue System Using Real Training Data
The first spoken dialogue system developed for the Persian language is introduced. This is a ticket reservation system with Persian ASR and NLU modules. The focus of the paper is on learning the dialogue management module. In this work, real on-line training data are used during the learning process. For on-line learning, the effect of the variations of discount factor (g) on the learning speed...
متن کاملOn-Line Learning of a Persian Spoken Dialogue System Using Real Training Data
The first spoken dialogue system developed for the Persian language is introduced. This is a ticket reservation system with Persian ASR and NLU modules. The focus of the paper is on learning the dialogue management module. In this work, real on-line training data are used during the learning process. For on-line learning, the effect of the variations of discount factor (g) on the learning speed...
متن کاملSome empirical findings on dialogue management and domain ontologies in dialogue systems - Implications from an evaluation of BirdQuest
In this paper we present implications for development of dialogue systems, based on an evaluation of the system BIRDQUEST which combine dialogue interaction with information extraction. A number of issues detected during the evaluation concerning primarily dialogue management, and domain knowledge representation and use are presented and discussed.
متن کاملEmpirical Evaluation of a Reinforcement Learning Spoken Dialogue System
We report on the design, construction and empirical evaluation of a large-scale spoken dialogue system that optimizes its performance via reinforcement learning on human user dialogue data.
متن کاملUcg Used by Response Generation
The paper deals with a spoken dialogue system component – response generation module. We are developing the spoken dialogue system called CIC (city information centre) providing a subset of services of a real city information centre. The main focus of this article is an experiment with usage of UCG (Unification Categorial Grammar) for response generation within a dialogue system speaking Czech....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016